Skip to content

Conversation

@jan-elastic
Copy link
Contributor

No description provided.

@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label v9.1.0 labels May 2, 2025
@jan-elastic jan-elastic force-pushed the esql-sample-agg-2 branch from 55fe96f to 62de767 Compare May 2, 2025 13:23
@jan-elastic jan-elastic added >feature :ml Machine learning Team:ML Meta label for the ML team labels May 2, 2025
@elasticsearchmachine elasticsearchmachine removed the needs:triage Requires assignment of a team area label label May 2, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@elasticsearchmachine
Copy link
Collaborator

Hi @jan-elastic, I've created a changelog YAML for you.

@jan-elastic jan-elastic requested a review from alex-spies May 2, 2025 15:36
"version" },
description = "Collects sample values for a field.",
type = FunctionType.AGGREGATE,
examples = @Example(file = "stats_sample", tag = "doc")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have an example of the output in the docs. I'm not entirely sure the right way to hack that one up because it's non-deterministic. Maybe it's hand rolled.

I think we want that example because my first question when reading this is "can I get duplicates or do those count as distinct samples?" Mostly because I'm not good at statistics.

I do think it's interesting that SAMPLE(bool) is strictly more work than VALUES(bool). It feels like sampling shouldn't be, but it makes some sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense that SAMPLE(bool) is more work. VALUES(bool) just keeps track of two boolean values: does true exist and does false exist. SAMPLE(bool) does more.

Copy link
Contributor Author

@jan-elastic jan-elastic May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obv, I prefer some example output too. I didn't know how to achieve that, but I'll think of something. Should've left a TODO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I've hacked up something. Not particularly proud of it, but it gets the job done.

@jan-elastic jan-elastic force-pushed the esql-sample-agg-2 branch 2 times, most recently from 717536b to 3c4bae7 Compare May 6, 2025 10:42
this.breaker = bigArrays.breakerService().getBreaker(CircuitBreaker.REQUEST);
this.sort = new BytesRefBucketedSort(breaker, "sample", bigArrays, SortOrder.ASC, limit);
this.bytesRefBuilder = new BreakingBytesRefBuilder(breaker, "sample");
this.random = new SplittableRandom();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes on this SplittableRandom:

  • If I replace it by Random, I get the precommit error

Forbidden method invocation: java.util.Random#() [Use org.elasticsearch.common.Randomness#get for reproducible sources of randomness]

Using SplittableRandom instead works around this by not being on the blacklist, but that's not in the spirit of what's intended.

  • If I replace it by Randomness.get() in the constructor, I get:

java.lang.IllegalStateException: This Random was created for/by another thread (Thread[#39,TEST-SampleLongAggregatorFunctionTests.testManyInitialManyPartialFinalRunner-seed#[B3B51719A90700AD],5,TGRP-SampleLongAggregatorFunctionTests]). Random instances must not be shared (acquire per-thread). Current thread: Thread[#51,elasticsearch[test][esql_test_executor][T#1],5,TGRP-SampleLongAggregatorFunctionTests]

Even though the Aggregator is used on a single thread (I hope; otherwise there are more issues), it's created on a different thread then the thread that's actively using it.

  • If I replace it by Randomness.get() inside the add method, the test SampleLongAggregatorFunctionTests::testDistribution fails. It looks like each iteration instantiates the same random generator (same seed), leading to the statistics being completely wrong.

I'm still looking into these issues. If you have any thoughts, let me know.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK! The Lucene stuff has stuff like:

        if (Thread.currentThread() != prevThread) {
            prevThread = Thread.currentThread();
            random = Randomness.get();
        }

That might do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you exactly mean by this. Where's this stuff exactly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also whipped up a different fix. Let me know what you think...

@jan-elastic jan-elastic force-pushed the esql-sample-agg-2 branch from 3c4bae7 to d46b18f Compare May 6, 2025 11:21
@jan-elastic jan-elastic requested a review from nik9000 May 6, 2025 12:20
@jan-elastic jan-elastic force-pushed the esql-sample-agg-2 branch from d46b18f to 520087d Compare May 6, 2025 14:51
@jan-elastic jan-elastic requested review from ivancea and removed request for alex-spies May 7, 2025 08:17
Copy link
Contributor

@ivancea ivancea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Added some questions and things to check 👀

@jan-elastic jan-elastic force-pushed the esql-sample-agg-2 branch from 520087d to 940b6f4 Compare May 7, 2025 12:20
@jan-elastic jan-elastic requested a review from ivancea May 7, 2025 12:26
@jan-elastic jan-elastic added the ES|QL-ui Impacts ES|QL UI label May 7, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/kibana-esql (ES|QL-ui)

Copy link
Contributor

@ivancea ivancea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@jan-elastic jan-elastic force-pushed the esql-sample-agg-2 branch from 865c90c to c1dc208 Compare May 7, 2025 14:13
@jan-elastic jan-elastic requested a review from a team as a code owner May 7, 2025 14:13
@jan-elastic jan-elastic merged commit 9cf2a64 into elastic:main May 8, 2025
17 checks passed
@jan-elastic jan-elastic deleted the esql-sample-agg-2 branch May 8, 2025 06:02
ywangd pushed a commit to ywangd/elasticsearch that referenced this pull request May 9, 2025
* ES|QL SAMPLE aggregation function

* [CI] Auto commit changes from spotless

* ThreadLocalRandom -> SplittableRandom

* Update docs/changelog/127629.yaml

* fix yaml test

* Add SampleTests

* docs + example

* polish code

* mark generated imports

* comment with algorith description

* use Randomness.get()

* close properly

* type checks

* reuse hash

* regen some files

* [CI] Auto commit changes from spotless

---------

Co-authored-by: elasticsearchmachine <[email protected]>
jfreden pushed a commit to jfreden/elasticsearch that referenced this pull request May 12, 2025
* ES|QL SAMPLE aggregation function

* [CI] Auto commit changes from spotless

* ThreadLocalRandom -> SplittableRandom

* Update docs/changelog/127629.yaml

* fix yaml test

* Add SampleTests

* docs + example

* polish code

* mark generated imports

* comment with algorith description

* use Randomness.get()

* close properly

* type checks

* reuse hash

* regen some files

* [CI] Auto commit changes from spotless

---------

Co-authored-by: elasticsearchmachine <[email protected]>
jan-elastic added a commit to jan-elastic/elasticsearch that referenced this pull request Jun 18, 2025
* ES|QL SAMPLE aggregation function

* [CI] Auto commit changes from spotless

* ThreadLocalRandom -> SplittableRandom

* Update docs/changelog/127629.yaml

* fix yaml test

* Add SampleTests

* docs + example

* polish code

* mark generated imports

* comment with algorith description

* use Randomness.get()

* close properly

* type checks

* reuse hash

* regen some files

* [CI] Auto commit changes from spotless

---------

Co-authored-by: elasticsearchmachine <[email protected]>
jan-elastic added a commit to jan-elastic/elasticsearch that referenced this pull request Jun 18, 2025
* ES|QL SAMPLE aggregation function

* [CI] Auto commit changes from spotless

* ThreadLocalRandom -> SplittableRandom

* Update docs/changelog/127629.yaml

* fix yaml test

* Add SampleTests

* docs + example

* polish code

* mark generated imports

* comment with algorith description

* use Randomness.get()

* close properly

* type checks

* reuse hash

* regen some files

* [CI] Auto commit changes from spotless

---------

Co-authored-by: elasticsearchmachine <[email protected]>
jan-elastic added a commit to jan-elastic/elasticsearch that referenced this pull request Jun 18, 2025
* ES|QL SAMPLE aggregation function

* [CI] Auto commit changes from spotless

* ThreadLocalRandom -> SplittableRandom

* Update docs/changelog/127629.yaml

* fix yaml test

* Add SampleTests

* docs + example

* polish code

* mark generated imports

* comment with algorith description

* use Randomness.get()

* close properly

* type checks

* reuse hash

* regen some files

* [CI] Auto commit changes from spotless

---------

Co-authored-by: elasticsearchmachine <[email protected]>
jan-elastic added a commit to jan-elastic/elasticsearch that referenced this pull request Jun 19, 2025
* ES|QL SAMPLE aggregation function

* [CI] Auto commit changes from spotless

* ThreadLocalRandom -> SplittableRandom

* Update docs/changelog/127629.yaml

* fix yaml test

* Add SampleTests

* docs + example

* polish code

* mark generated imports

* comment with algorith description

* use Randomness.get()

* close properly

* type checks

* reuse hash

* regen some files

* [CI] Auto commit changes from spotless

---------

Co-authored-by: elasticsearchmachine <[email protected]>
jan-elastic added a commit that referenced this pull request Jun 19, 2025
* ES|QL SAMPLE aggregation function (#127629)

* ES|QL SAMPLE aggregation function

* [CI] Auto commit changes from spotless

* ThreadLocalRandom -> SplittableRandom

* Update docs/changelog/127629.yaml

* fix yaml test

* Add SampleTests

* docs + example

* polish code

* mark generated imports

* comment with algorith description

* use Randomness.get()

* close properly

* type checks

* reuse hash

* regen some files

* [CI] Auto commit changes from spotless

---------

Co-authored-by: elasticsearchmachine <[email protected]>

* Fix + unmute SampleTests (#127959)

* Fix memory tracking of ES|QL sample agg (#128467)

* Fix memory tracking of ES|QL sample agg

* [CI] Auto commit changes from spotless

* polish code

---------

Co-authored-by: elasticsearchmachine <[email protected]>

* ESQL: Unclean generated imports (#127723)

This removes a ton of the tricky juggling we do for generated java files
to keep the imports in order. Instead, we just live with them being out
of order a little. It's not great, but it's so so so much easier than
the terrible juggling we had been doing.

* ESQL: Disable format checks on generated imports (#127648)

This builds the infrastructure to disable spotless and some checkstyle
rules on generated imports. This works around the most frustrating part
of ESQL's string template generated files - the imports. It allows
unused and out of order imports. This can let us remove a lot of
cumbersome, tricky, and fairly useless `$if$` blocks from the templates.

---------

Co-authored-by: elasticsearchmachine <[email protected]>
Co-authored-by: Nik Everett <[email protected]>
@leemthompo
Copy link
Contributor

@jan-elastic just a reminder here, that we'll need to add applies_to tags to this function eventually when it gets added per README.md#version-differentiation-in-docs-v3 :)

@jan-elastic
Copy link
Contributor Author

@leemthompo Are you fixing this as well together with the other fixes? Thanks!

@leemthompo
Copy link
Contributor

@jan-elastic yup sure can do, what's the correct availability info for this?

@jan-elastic
Copy link
Contributor Author

jan-elastic commented Jul 24, 2025

@jan-elastic yup sure can do, what's the correct availability info for this?

Thanks!

The SAMPLE command did not exist in 9.0, and is GA in 9.1.

leemthompo added a commit to leemthompo/elasticsearch that referenced this pull request Jul 24, 2025
leemthompo added a commit to leemthompo/elasticsearch that referenced this pull request Jul 24, 2025
leemthompo added a commit that referenced this pull request Jul 28, 2025
* Add applies to to ScalB function in #127696

* Add applies_to to categorize, follow up to #129398

* Add version info, following #127629

* SAMPLE is new + GA in 9.1 #127629

* add applies to for 9.2 option
leemthompo added a commit to leemthompo/elasticsearch that referenced this pull request Jul 28, 2025
* Add applies to to ScalB function in elastic#127696

* Add applies_to to categorize, follow up to elastic#129398

* Add version info, following elastic#127629

* SAMPLE is new + GA in 9.1 elastic#127629

* add applies to for 9.2 option

(cherry picked from commit 5d565b5)

# Conflicts:
#	docs/reference/query-languages/esql/_snippets/functions/parameters/categorize.md
#	x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/grouping/Categorize.java
elasticsearchmachine pushed a commit that referenced this pull request Jul 28, 2025
* Add applies to to ScalB function in #127696

* Add applies_to to categorize, follow up to #129398

* Add version info, following #127629

* SAMPLE is new + GA in 9.1 #127629

* add applies to for 9.2 option

(cherry picked from commit 5d565b5)

# Conflicts:
#	docs/reference/query-languages/esql/_snippets/functions/parameters/categorize.md
#	x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/grouping/Categorize.java
afoucret pushed a commit to afoucret/elasticsearch that referenced this pull request Jul 28, 2025
* Add applies to to ScalB function in elastic#127696

* Add applies_to to categorize, follow up to elastic#129398

* Add version info, following elastic#127629

* SAMPLE is new + GA in 9.1 elastic#127629

* add applies to for 9.2 option
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ES|QL-ui Impacts ES|QL UI >feature :ml Machine learning Team:ML Meta label for the ML team v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants